pie title "FCA Regulations by Ambiguity"
"Ambiguous Terms (60%)" : 60
"Purely Quantitative (40%)" : 40
Rigorous Evaluation for Responsible Deployment
2025-11-01
The Reality Check
The AI and Law research community increasingly emphasises rigorous evaluation of real systems on actual legal text.
Our Investigation:
Our Test
Recent computational legal studies distinguish between “law-as-code” (formal symbolic representations) and “law-as-data” (statistical pattern learning).
The Surprise: GraphDB remains structurally strongest (≈88% GED similarity), but modern AI architectures are closer than many would expect (≈72–74%), so there is no simple ‘AI wins’ or ‘AI fails’ story.
Strengths:
Limitations:
Strengths:
Limitations:
The Trade-Off
This isn’t about one architecture “winning”—it’s about fundamentally different trade-offs between knowledge engineering costs and output processing complexity. Your choice depends on operational context, not raw accuracy.
Parallel from Medical AI
Recent research at Johns Hopkins shows doctors face similar challenges calibrating reliance on AI recommendations—but this is the first systematic documentation of inverse calibration in regulatory compliance automation.
Research on AI in high-stakes medical decisions has documented “overconfidence” problems. But we discovered something more concerning: systematic inverse calibration.
Expected (Good Calibration):
High Confidence → High Accuracy ✅
When model says “90% confident,” accuracy should be ~90%
Reality (Inverse Calibration):
High Confidence → Low Accuracy ❌
When model says “90% confident,” accuracy is often worst
The Discovery: Negative Confidence-Accuracy Correlation
| Model Architecture | Calibration (r)* | Interpretation |
|---|---|---|
| Few-Shot LLM | -0.545 | 🚨 INVERSE |
| GraphRAG (k=10) | -0.469 | 🚨 INVERSE |
| GraphRAG (k=50) | -0.500 | 🚨 INVERSE |
| Constrained Generation | -0.352 | 🚨 INVERSE |
| Logic-LM | +0.135 | ✅ Positive (weak) |
| Chain-of-Thought | +0.190 | ✅ Positive (weak) |
*Correlation coefficient: +1 = perfect calibration, 0 = no relationship, -1 = inverse calibration
Naive confidence-based routing would:
Example
Rule: “Fund must allocate ≥80% to bonds, ≥30% to equities…”
Model Output:
Key Insight
This is the first systematic characterization of inverse calibration specifically in regulatory text extraction tasks. It has profound implications for safe deployment.
Legal scholars have long recognised that ambiguity serves essential functions—permitting contextual application, enabling evolutionary interpretation, and preserving professional judgement (esposito2021transparency?; he2025statutory?; dugac2025classifying?).
Analysis of 34 FCA Regulations:
pie title "FCA Regulations by Ambiguity"
"Ambiguous Terms (60%)" : 60
"Purely Quantitative (40%)" : 40
The 40% We Can Automate:
The 60% That Resist Automation:
This Is Essential Design
These ambiguous terms serve critical regulatory functions:
Recent computational law research argues that attempting to eliminate ambiguity through rigid formalization would either produce unworkable specificity or drain rules of practical meaning.
Regulation requires some ambiguity.
Given inverse calibration (confidence scores mislead) and pervasive ambiguity (60% of rules resist full automation), we developed an ambiguity-aware routing framework.
Legal Ambiguity Serves Essential Functions
Legal scholars recognise ambiguity isn’t bad drafting—it’s essential design:
We don’t try to eliminate ambiguity—we design around it.
Example: “Fund must hold ≥80% equity securities”
Example: “Fund predominantly invested in equity securities”
Example: “Fund employs appropriate controls for liquidity risk”
❌ Traditional (Fails):
Key Advantage
Doesn’t rely on unreliable AI self-assessment. Routes based on measurable properties of regulatory text that we can verify independently.
Projected Performance (1,000 regulatory submissions):
| Tier | Rules | Processing | Cost/Sub | Human Effort |
|---|---|---|---|---|
| Automatic | 40% | AI only | $1 | 0% |
| Hybrid | 20% | AI + verification | $50 | 10-20 min |
| Human | 40% | Expert review | $200 | Full effort |
| CURRENT | 100% | Manual | $300 | 100% |
| PROPOSED | 100% | Graduated | $110 | ~40% |
Per Institution (10,000 checks/year):
UK Financial Sector (1,000+ firms):
Safety Properties:
✅ High-risk ambiguous rules get human review
✅ Quality maintained through graduated oversight
✅ Transparent routing rationale
✅ Robust to calibration failures
The Confidence Trap Documented
Medical AI shows similar problems but not for regulatory compliance
Fair Symbolic-Neural Comparison
Symbolic ≈ Neural (both ~73% accurate) with different strengths
Graduated Automation That Works
Route by ambiguity not confidence scores
Key Achievement:
Systematic empirical assessment of AI for regulatory compliance that documents both capabilities and critical limitations, enabling graduated automation despite inverse calibration.
Contact:
Professor Barry Quinn CStat, PhD
b.quinn@ulster.ac.uk
Resources:
Next Steps:
Discussion Topics:
UKFin+ | UKRI/Innovate UK FE11QUI24